State-of-the-art methods for 3D hand pose estimation from depth imagesrequire large amounts of annotated training data. We propose to model thestatistical relationships of 3D hand poses and corresponding depth images usingtwo deep generative models with a shared latent space. By design, ourarchitecture allows for learning from unlabeled image data in a semi-supervisedmanner. Assuming a one-to-one mapping between a pose and a depth map, any givenpoint in the shared latent space can be projected into both a hand pose and acorresponding depth map. Regressing the hand pose can then be done by learninga discriminator to estimate the posterior of the latent pose given some depthmaps. To improve generalization and to better exploit unlabeled depth maps, wejointly train a generator and a discriminator. At each iteration, the generatoris updated with the back-propagated gradient from the discriminator tosynthesize realistic depth maps of the articulated hand, while thediscriminator benefits from an augmented training set of synthesized andunlabeled samples. The proposed discriminator network architecture is highlyefficient and runs at 90 FPS on the CPU with accuracies comparable or betterthan state-of-art on 3 publicly available benchmarks.
展开▼